Data-Driven Analysis of Hole-Transporting Materials for Perovskite Solar Cells Performance

We have created a dataset of 269 perovskite solar cells, containing information about their perovskite family, cell architecture, and multiple hole-transporting materials features, including fingerprints, additives, and structural and electronic features. We propose a predictive machine learning model that is trained on these data and can be used to screen possible candidate hole-transporting materials. Our approach allows us to predict the performance of perovskite solar cells with reasonable accuracy and is able to successfully identify most of the top-performing and lowest-performing hole-transporting materials in the dataset. We discuss the effect of data biases on the distribution of perovskite families/architectures on the model’s accuracy and offer an analysis with a subset of the data to accurately study the effect of the hole-transporting material on the solar cell performance. Finally, we discuss some chemical fragments, like arylamine and aryloxy groups, which present a relatively large positive correlation with the efficiency of the cell, whereas other groups, like thiophene groups, display a negative correlation with power conversion efficiency (PCE).

The correlation matrix of all these features are shown in Figure S1 (for the heterogeneous database), and we then remove those features with a Pearson correlation larger than 0.7: -molecular weight (large correlation with rotatable bonds) -fluorenes (large correlation with aliphatic carbocycles) -aliphatic rings (large correlation with aliphatic heterocycles) -thiophenes (large correlation with aromatic heterocycles) -aromatic rings (large correlation with aromatic carbocycles) -rings (large correlation with aromatic carbocycles) -heterocycles (large correlation with aromatic heterocycles) -bridgehead atoms (large correlation with phthalocyanines) Figure S1. Correlation matrix with the initial 32 structural features in the heterogeneous dataset.

S3
Finally, we used a k-nearest neighbors model, using all the 269 data points to estimate the root mean square error (rmse) when using a 10-fold cross-validation. We performed a recursive feature selection, whose results are shown in Figure S2, and the following descriptors are dropped sequentially: spiro atoms, amide bonds, azulene, diphenylamine, xanthene, aromatic carbocycles, heteroatoms, aromatic heterocycles, molecular planarity, sp 3   In the case of the homogeneous database, there is only one molecule with a phthalocyanine group, and there are no porphyrins present, so we do not consider these two features. In Figure S3, we show the resulting correlation matrix with the remaining 30 structural features. We remove the following nine features with a Pearson correlation larger than 0. We use the same procedure as in the heterogeneous database to perform a recursive feature selection on the remaining 21 structural features of the homogeneous database, as shown in Figure S4, where we drop the following features sequentially: diphenylamine, amide bonds, carbazole, aromatic carbocycles, bridgehead atoms, silicon atoms, xanthene, furan, molecular planarity, polymer, aromatic heterocycles, aliphatic heterocycles, stereocenters, heteroatoms, aliphatic carbocycles, benzotrithiophene, rotatable bonds, spiro atoms, triphenylamine. We can observe how the optimum number of structural features is reached with nine features: 1. Spiro atoms 4. Acenaphthene 7. Aliphatic heterocycles 2. Carbazole 5. Benzotrithiophene 8. Aromatic heterocycles 3. Triphenylamine 6. sp 3 Carbons 9. Molecular planarity (PBF) 1 Figure S4. rmse values obtained with a feature recursive elimination in the homogeneous database.

S5
b. Electronic features Figure S5. Ionization potential (IP) and HOMO energy data in the heterogeneous database.

S2. Conformer search
Starting from the SMILES string of each HTM, we calculated the ensemble of the ten most stable conformers with a UFF forcefield, using Open Babel. 2 Then, we optimized these ten geometries using PM7 3 with Gaussian16, 4 and the most stable geometry was used as a starting point for the subsequent density functional theory (DFT) calculations.

S3. ML model a. Differential evolution algorithm
We used a differential evolution algorithm, 5 as implemented in SciPy, 6 using a population size of 15 per parameter, a recombination rate of 0.7 and a mutation of 0.5-1.0.

S4. kNN results
Using the k-nearest neighbours algorithm, we obtain the following results with the homogeneous dataset, which present a similar trend to the KRR results.  Figure S7. Experimental and predicted PCE of data in the homogeneous database sing kNN, when using different types of features.

S5. Chemical fragments correlation with PCE
Given than most of the model performance is due to the fingerprints, we can analyse which bits of the fingerprints are more correlated with PCE.
For each bit, we created an array of length N (where N is the number of molecules in the dataset), which will have values of either 0 (bit absent in that molecule) or 1 (bit present in that molecule). Then, we can study the correlation of these with an array containing the PCE values of each molecule in the dataset. We used a point biserial correlation coefficient (r), which is equivalent to Pearson's correlation when one variable is binary. We give this value, its corresponding p-value (as the probability of observing the same or larger |r| if data is uncorrelated) and the 95% confidence interval below, for the 10 fragments with the largest |r| values. . For example, a large value of 1 indicates that there is a significant difference between the 0 and 1 arrays, with 1 having larger values.

S8
The results for the 19 molecules with | | > 0.3 in the homogeneous database are shown in Table S3, and the results for the 27 molecules with | | > 0.2 in the heterogeneous database are shown in Table S4.  Table S5. Results for the bits whose correlation coefficient | | > 0.2 with respect to PCE for all molecules in the heterogeneous dataset, with its corresponding p-value and 95% confidence interval, as well as the Mann-Whitney U test values ( 1 and 2 ) and its associated p-value for each fragment.